Overview of Dataset

The dataset was obtained from Kaggle. It had 299 observations and 13 variables. the outcome variable ‘DEATH_EVENT’ indicates whether a patient died of heart failure or not based on 11 other predictors. The variable names are shown below:

NB: The 12th variable ‘time’ indicated the time from the start of the study after which the study was terminated. This,presumably,could be either because the subject was declared healthy, or dropped out of the study for various reasons, or died from heart failure. To avoid target leakage, since that time would not be available in real world instances when the resultant model is being used to predict the outcome of a new case, the ‘time’ variable would not be used as a feature to train the model.

## Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
##        'ejection_fraction', 'high_blood_pressure', 'platelets',
##        'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
##        'DEATH_EVENT'],
##       dtype='object')

Brief Exploratory Data Analysis

We can do a quick overview of the only two demographic variables from the dataset: age and sex. from the output below, we realize that the age range of the respondents is 40 to 95 years with a median age of 60 years and an average age of approximately 60 years.

## count    299.000000
## mean      60.833893
## std       11.894809
## min       40.000000
## 25%       51.000000
## 50%       60.000000
## 75%       70.000000
## max       95.000000
## Name: age, dtype: float64

Pre-selection of Features and Feature Engineering

We are closer to our goal of comparing the performance of various ML models on the dataset. Features here are pre-selected based on domain knowledge.First, let us check our outcome variable. In our dataset, the proportion of “No” examples for our outcome variable is much higher than “Yes” examples. The main challenge with imbalanced dataset prediction is how accurately the ML model would predict both majority and minority classes. Thus, there is the danger of our ML algorithms being biased if trained on this data as they would have way more “No” examples to learn from.

We would solve this imbalance with some some feature engineering with Synthetic Minority Oversampling Technique (SMOTE). SMOTE utilizes a k-nearest neighbour algorithm helps to overcome the overfitting problem that might occur if we use random oversampling. I chose SMOTE instead of Random undersampling of the majority class because I want to preserve the data and not eliminate any examples since I do not have much training data to begin with!

Feature Selection

First, we identify features with low variance since they would not help the model much in finding patterns and de-select them. We will also check if there is multicollinearity amongst any of the features and we de-select one per pair.

VarianceThreshold(threshold=0.15)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## array([ True,  True,  True,  True,  True,  True,  True,  True])

From the results, per our threshold criteria, all the features have high variance (< 85% similarity amongst values).

##                                age   anaemia  ...  high_blood_pressure   smoking
## age                       1.000000  0.046890  ...             0.041698 -0.032009
## anaemia                   0.046890  1.000000  ...             0.067441 -0.070495
## creatinine_phosphokinase -0.124283 -0.183755  ...            -0.051252  0.023303
## diabetes                 -0.160681  0.024540  ...             0.017428 -0.074121
## ejection_fraction         0.079462  0.049630  ...             0.036467 -0.028121
## serum_sodium             -0.056309  0.034433  ...             0.092173  0.061729
## high_blood_pressure       0.041698  0.067441  ...             1.000000  0.019311
## smoking                  -0.032009 -0.070495  ...             0.019311  1.000000
## 
## [8 rows x 8 columns]

There is no collinearity amongst the variables.

Now, we’re going to use SequentialFeatureSelector(SFS) from the mlxtend library, which is a Python library of data science tools. SFS is a greedy procedure where, at each iteration, we choose the best new feature to add to our selected features based on a cross-validation score. For forward selection, we start with 0 features and choose the best single feature with the highest score. The procedure is repeated until we reach the desired number of selected features. We will use the “best” option, where the selector returns the feature subset with the best cross-validation performance.

Sequential Forward Selection

SequentialFeatureSelector(estimator=LogisticRegression(max_iter=1000),
                          k_features=(1, 8), scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## ('age', 'anaemia', 'diabetes', 'ejection_fraction', 'serum_sodium')

Let’s do a little experiment and see which features are selected when we use the raw data before SMOTE was applied:

SequentialFeatureSelector(estimator=LogisticRegression(max_iter=1000),
                          k_features=(1, 8), scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## ('age', 'diabetes', 'ejection_fraction', 'high_blood_pressure')

The results are different. We see the importance of data pre-processing and feature engineering before rushing ahead with machine learning.

We will go with the features selected from the SMOTE-transformed data.

The following is just to create a dictionary of each model as the key and its metrics as the values

## {'LogisticRegression(max_iter=1000)': [65.85365853658537, 65.3061224489796, 74.4186046511628], 'SVC()': [67.07317073170732, 66.0, 76.74418604651163], 'KNeighborsClassifier()': [63.41463414634146, 65.11627906976744, 65.11627906976744], 'DecisionTreeClassifier()': [64.63414634146342, 66.66666666666666, 65.11627906976744], 'RandomForestClassifier()': [71.95121951219512, 72.72727272727273, 74.4186046511628], 'GradientBoostingClassifier()': [73.17073170731707, 74.4186046511628, 74.4186046511628]}

Converting the dictionary into a dataframe for better visual exploration of models and their metrics.

##                                index   Accuracy  Precision     Recall
## 0  LogisticRegression(max_iter=1000)  65.853659  65.306122  74.418605
## 1                              SVC()  67.073171  66.000000  76.744186
## 2             KNeighborsClassifier()  63.414634  65.116279  65.116279
## 3           DecisionTreeClassifier()  64.634146  66.666667  65.116279
## 4           RandomForestClassifier()  71.951220  72.727273  74.418605
## 5       GradientBoostingClassifier()  73.170732  74.418605  74.418605